antmaze task
Appendix A Convergence of with the hybrid loss
Before presenting the formal version of Theorem 4.1 and its proof, we introduce some preliminaries. As stated in Theorem 4.1, we assume that both the discriminator class Now we are ready to present a formal version of Theorem 4.1 as follows. By the triangle inequality and Eq.A.12, we obtain λ null By Eq.A.2, Eq.A.11, and Eq.A.14, we have d In this section, we prove Proposition 3.1. In this section, we will give a brief proof of Theorem 4.2, and show that the learning policy can find Suppose the stationary point of the Bellman equation w.r.t the production sample space In this section, we will give a brief proof of Theorem 4.3, and show the convergence of the learning First, we show the monotonic improvement of Q function of the iterated policy by CPED. The Gym-MuJoCo is a commonly used benchmark for offline RL task.
Appendix A Convergence of with the hybrid loss
Before presenting the formal version of Theorem 4.1 and its proof, we introduce some preliminaries. As stated in Theorem 4.1, we assume that both the discriminator class Now we are ready to present a formal version of Theorem 4.1 as follows. By the triangle inequality and Eq.A.12, we obtain λ null By Eq.A.2, Eq.A.11, and Eq.A.14, we have d In this section, we prove Proposition 3.1. In this section, we will give a brief proof of Theorem 4.2, and show that the learning policy can find Suppose the stationary point of the Bellman equation w.r.t the production sample space In this section, we will give a brief proof of Theorem 4.3, and show the convergence of the learning First, we show the monotonic improvement of Q function of the iterated policy by CPED. The Gym-MuJoCo is a commonly used benchmark for offline RL task.
DRDT3: Diffusion-Refined Decision Test-Time Training Model
Huang, Xingshuai, Wu, Di, Boulet, Benoit
Decision Transformer (DT), a trajectory modeling method, has shown competitive performance compared to traditional offline reinforcement learning (RL) approaches on various classic control tasks. However, it struggles to learn optimal policies from suboptimal, reward-labeled trajectories. In this study, we explore the use of conditional generative modeling to facilitate trajectory stitching given its high-quality data generation ability. Additionally, recent advancements in Recurrent Neural Networks (RNNs) have shown their linear complexity and competitive sequence modeling performance over Transformers. We leverage the Test-Time Training (TTT) layer, an RNN that updates hidden states during testing, to model trajectories in the form of DT. We introduce a unified framework, called Diffusion-Refined Decision TTT (DRDT3), to achieve performance beyond DT models. Specifically, we propose the Decision TTT (DT3) module, which harnesses the sequence modeling strengths of both self-attention and the TTT layer to capture recent contextual information and make coarse action predictions. We further integrate DT3 with the diffusion model using a unified optimization objective. With experiments on multiple tasks of Gym and AntMaze in the D4RL benchmark, our DT3 model without diffusion refinement demonstrates improved performance over standard DT, while DRDT3 further achieves superior results compared to state-of-the-art conventional offline RL and DT-based methods.
Revisiting the Minimalist Approach to Offline Reinforcement Learning
Tarasov, Denis, Kurenkov, Vladislav, Nikulin, Alexander, Kolesnikov, Sergey
Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many incorporate seemingly minor design choices that impact their effectiveness beyond core algorithmic advances. However, the effect of these design choices on established baselines remains understudied. In this work, we aim to bridge this gap by conducting a retrospective analysis of recent works in offline RL and propose ReBRAC, a minimalistic algorithm that integrates such design elements built on top of the TD3+BC method. We evaluate ReBRAC on 51 datasets with both proprioceptive and visual state spaces using D4RL and V-D4RL benchmarks, demonstrating its state-of-the-art performance among ensemble-free methods in both offline and offline-to-online settings. To further illustrate the efficacy of these design choices, we perform a large-scale ablation study and hyperparameter sensitivity analysis on the scale of thousands of experiments.
When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning
Li, Jianxiong, Zhan, Xianyuan, Xu, Haoran, Zhu, Xiangyu, Liu, Jingjing, Zhang, Ya-Qin
In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints. Offline reinforcement learning (RL) provides a new possibility to learn optimized policies from large, pre-collected datasets without any environment interaction (Levine et al., 2020). This holds great promise to solve many real-world problems when online interaction is costly or dangerous yet historical data is easily accessible (Zhan et al., 2022). However, the optimization nature of RL, as well as the need for counterfactual reasoning on unseen data under offline setting, have caused great technical challenges for designing effective offline RL algorithms. Evaluating value function outside data coverage areas can produce falsely optimistic values; without corrective information from online interaction, such estimation errors can accumulate quickly and misguide policy learning process (Van Hasselt et al., 2018; Fujimoto et al., 2018; Kumar et al., 2019). Recent model-free offline RL methods investigate this error accumulation challenge in several ways: 1) Policy Constraint: directly constraining learned policy to stay inside distribution, or with the support of dataset (Kumar et al., 2019); 2) Value Regularization: regularizing value function to assign low values at out-of-distribution (OOD) actions (Kumar et al., 2020b); 3) In-sample Learning: learning value function within data samples (Kostrikov et al., 2021b) or simply treating it as the value function of behavioral policy (Brandfonbrener et al., 2021). All three schools of methods share similar traits of being conservative and omitting evaluation on OOD data, which brings benefits of minimizing model exploitation error, but at the expense of poor generalization of learned policy in OOD regions.